# CSC 2224: Parallel Computer Architecture and Programming Advanced Memory

Prof. Gennady Pekhimenko University of Toronto Fall 2019

The content of this lecture is adapted from the slides of Vivek Seshadri, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU

# **Review #5**

## Flipping Bits in Memory Without Accessing Them

Yoongu Kim et al., ISCA 2014



#### Memory latency remains almost constant

### We Need A Paradigm Shift To ...

Enable computation with minimal data movement

Compute where it makes sense (where data resides)

Make computing architectures more data-centric

### Processing Inside Memory



### Why In-Memory Computation Today?

#### Pull from Systems and Applications

- Data access is a major system and application bottleneck
- Systems are energy limited
- Data movement much more energy-hungry than computation

HIPEAC

# Two Approaches to In-Memory Processing

- 1. Minimally change DRAM to enable simple yet powerful computation primitives
  - <u>RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data</u> (Seshadri et al., MICRO 2013)
  - □ Fast Bulk Bitwise AND and OR in DRAM (Seshadri et al., IEEE CAL 2015)
  - <u>Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial</u> <u>Locality of Non-unit Strided Accesses</u> (Seshadri et al., MICRO 2015)
- 2. Exploit the control logic in 3D-stacked memory to enable more comprehensive computation near memory
  - <u>PIM-Enabled Instructions: A Low-Overhead, Locality-Aware</u>
     <u>Processing-in-Memory Architecture</u> (Ahn et al., ISCA 2015)
  - <u>A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing</u> (Ahn et al., ISCA 2015)
  - <u>Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges,</u> <u>Mechanisms, Evaluation</u> (Hsieh et al., ICCD 2016)

# Approach 1: Minimally Changing DRAM

- DRAM has great capability to perform bulk data movement and computation internally with small changes
  - Can exploit internal bandwidth to move data
  - Can exploit analog computation capability

••••

- Examples: RowClone, In-DRAM AND/OR, Gather/Scatter DRAM
  - RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data (Seshadri et al., MICRO 2013)
  - <u>Fast Bulk Bitwise AND and OR in DRAM</u> (Seshadri et al., IEEE CAL 2015)
  - <u>Gather-Scatter DRAM: In-DRAM Address Translation to Improve</u> the Spatial Locality of Non-unit Strided Accesses (Seshadri et al., MICRO 2015)

### Starting Simple: Data Copy and Initialization

# Bulk Data Copy



# Bulk Data Initialization



### Bulk Data Copy and Initialization

| 4 1 Th (1011/11)                                                                                                                                                 | <b>ds on Operating System Performan</b><br>d Bugnion, Stephen Alan Herrod,<br>el, and Anoop Gupta                        |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| Hardware Support for Bulk Data                                                                                                                                   | Movement in Server Platforms                                                                                             |
| Li Zhao <sup>†</sup> , Ravi Iyer <sup>‡</sup> Srihari Makineni <sup>†</sup><br><sup>†</sup> Department of Computer Science and Engineering<br>Email: {zhao, bhuy | <sup>‡</sup> , Laxmi Bhuyan <sup>†</sup> and Don Newell <sup>‡</sup><br>g, University of California, Riverside, CA 92521 |
| Communications Technology Lab. Inte                                                                                                                              |                                                                                                                          |
| Architecture Support for Improving Bulk M<br>Performance                                                                                                         | emory Copying and Initialization<br>e                                                                                    |
| Xiaowei Jiang, Yan Solihin<br>Dept. of Electrical and Computer Engineering<br>North Carolina State University<br>Raleigh, USA                                    | Li Zhao, Ravishankar Iyer<br>Intel Labs<br>Intel Corporation<br>Hillsboro, USA                                           |

### Bulk Data Copy and Initialization

memmove & memcpy: 5% cycles in Google's datacenter [Kanev+ ISCA']



Today's Systems: Bulk Data Copy



1046ns, 3.6uJ (for 4KB page copy via DMA)

# Future Systems: In-Memory Copy



### **RowClone: In-DRAM Row Copy**



## RowClone: Intra-Subarray



# RowClone: Intra-Subarray (II)



2. Activate dst row (disconnect src from row buffer, connect dst – copy data from row buffer to dst)

## **RowClone: Inter-Bank**



**Overlap the latency of the read and the write 1.9X latency reduction, 3.2X energy reduction** 

### Generalized RowClone 0.01% area cost



### **RowClone: Fast Row Initialization**



Fix a row at Zero (0.5% loss in capacity)

#### RowClone: Bulk Initialization

#### Initialization with arbitrary data

- Initialize one row
- Copy the data to other rows
- Zero initialization (most common)
  - Reserve a row in each subarray (always zero)
  - Copy data from reserved row (FPM mode)
  - **6.0X** lower latency, **41.5X** lower DRAM energy
  - □ 0.2% loss in capacity

#### RowClone: Latency & Energy Benefits



### Copy and Initialization in Workloads



## RowClone: Application Performance



# End-to-End System Design



How to communicate occurrences of bulk copy/initialization across layers?

How to ensure cache coherence?

How to maximize latency and energy savings?

How to handle data reuse?

# Ambit

### In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology

#### Vivek Seshadri

Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, Todd C. Mowry



# **Executive Summary**

#### • Problem: Bulk bitwise operations

- present in many applications, e.g., databases, search filters
- existing systems are memory bandwidth limited
- Our Proposal: Ambit
  - perform bulk bitwise operations completely inside DRAM
  - bulk bitwise AND/OR: simultaneous activation of three rows
  - bulk bitwise NOT: inverters already in sense amplifiers
  - less than 1% area overhead over existing DRAM chips

#### Results compared to state-of-the-art baseline

- average across seven bulk bitwise operations
  - 32X performance improvement, 35X energy reduction
- 3X-7X performance for real-world data-intensive applications



[1] Li and Patel, BitWeaving, SIGMOD 2013[2] Goodwin+, BitFunnel, SIGIR 2017

# Today, DRAM is just a storage device!



Throughput of bulk bitwise operations limited by available memory bandwidth

# **Our Approach**



# Inside a DRAM Chip



# **DRAM Cell Operation**



# **DRAM Cell Operation**



### **Triple-Row Activation: Majority Function**



### **Bitwise AND/OR Using Triple-Row Activation**



# Bitwise AND/OR Using Triple-Row Activation



# Bulk Bitwise AND/OR in DRAM

Statically reserve three designated rows t1, t2, and t3

**Result = row A** 

3.

4.

5.

AND AND A to row t1

2. Copy data of row B to row t2

#### RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

MICRO

Chris Fallin\* Vivek Seshadri Yoongu Kim **Donghyuk Lee** vseshadr@cs.cmu.edu yoongukim@cmu.edu cfallin@c1f.net donghyuk1@cmu.edu Rachata Ausavarungnirun Gennady Pekhimenko Yixin Luo rachata@cmu.edu gpekhime@cs.cmu.edu yixinluo@andrew.cmu.edu Phillip B. Gibbons<sup>†</sup> Michael A. Kozuch<sup>†</sup> Todd C. Mowry Onur Mutlu onur@cmu.edu phillip.b.gibbons@intel.com michael.a.kozuch@intel.com tcm@cs.cmu.edu Carnegie Mellon University †Intel Pittsburgh

## Bulk Bitwise AND/OR in DRAM

Statically reserve three designated rows t1, t2, and t3

**Result = row A** 

AN Copy Rights Breaket A of now At 10 row t1

- 2. Eepy Ravalened at B es row t2
- 3: Initialize Row Glone date 3 to 0/1 to 0/1
- 4: Activate rows t1/t2/t3 simultane nusly sly 5: Copy RowClone data of row t1/t2/t3 to Result copy data of row t1/t2/t3 to Result

#### Use RowClone to perform copy and initialization operations completely in DRAM!

### **Negation Using the Sense Amplifier**



38

## **Negation Using the Sense Amplifier**



## **Negation Using the Sense Amplifier**



**40** 

#### Ambit vs. DDR3: Performance and Energy



# Integrating Ambit with the System

## **1. PCIe device**

- Similar to other accelerators (e.g., GPU)

## 2. System memory bus

Ambit uses the same DRAM command/address interface

#### Pros and cons discussed in paper (Section 5.4)

## **Real-world Applications**

- Methodology (Gem5 simulator)
  - Processor: x86, 4 GHz, out-of-order, 64-entry instruction queue
  - L1 cache: 32 KB D-cache and 32 KB I-cache, LRU policy
  - L2 cache: 2 MB, LRU policy
  - Memory controller: FR-FCFS, 8 KB row size
  - Main memory: DDR4-2400, 1 channel, 1 rank, 8 bank
- Workloads
  - Database bitmap indices
  - BitWeaving –column scans using bulk bitwise operations
  - Set operations comparing bitvectors with red-black trees

## **Bitmap Indices: Performance**



Consistent reduction in execution time. 6X on average

44

#### Speedup offered by Ambit for BitWeaving

select count(\*) where c1 < field < c2</pre>



#### FLIPPING BITS IN MEMORY WITHOUT ACCESSING THEM ISCA 2014



# ROW HAMMER

## DRAM WORDLI GHP VICTIM -AGGRESS VOLTAGE - MCTIM **READ DATA FROM** HERE, **GET ERRORS OVER** $H \in \mathbf{R} \in$

# GOOGLE'S EXPLOIT

#### Project Zero

News and updates from the Project Zero team at Google

Monday, March 9, 2015

Exploiting the DRAM rowhammer bug to gain kernel privileges "We learned about rowhammer from Yoongu Kim et al."

http://googleprojectzero.blogspot.com



#### **REAL SYSTEM** MANY **OPEN/CLOS READS TO** E SAME **SAME ROW ADDRESS**

**1. CACHE** 

HITS

50

**2. ROW** 

HITS

| <b>x86 CPU</b>          | DRAM                                      |
|-------------------------|-------------------------------------------|
| LOOP:<br>mov (X), %reg  | <b>1111</b> 111111<br>11011110010         |
| mov ( <b>Y</b> ), %reg  | 11011110010<br>1111111111<br>101110101111 |
| clflush (X)             | 1111 111111                               |
| clflush (Y)<br>jmp LOOP | MANY<br>ERRORS!                           |

http://www.github.com/CMU-SAFARI/rowhammer

# WHY DO THE ERRORS OCCUR?

# DRAM CELLS ARE LEAKY







# **ROOT CAUSE?**



# COUPLING

•Electromagnetic

•Tunneling

ACCELERATES CHARGE LOSS

# AS DRAM SCALES

#### $\bullet \bullet \bullet$

# • CELLS BECOME SMALLER Less tolerance to coupling effects

• CELLS BECOME PLACED CLOSER Stronger coupling effects

# 1. ERRORS ARE RECENT Not found in pre-2010 chips

# 2. ERRORS ARE WIDESPREAD >80% of chips have errors Up to one error per ~1K cells





# MOST MODULES AT RISK A VENDOR (37/43)

# B VENDOR C VENDOR







(28/32)

#### **MODULES:** $\bullet$ **A** $\blacksquare$ **B** $10^{6}$ $10^{5}$ $10^{4}$ ERRORS $10^{3}$ **PER** 10<sup>9</sup> $10^{2}$ CELLS $10^{1}$ $10^{0}$ 0 2008 2009 2010 2011 2012 2013 2014 MANUFACTURE DATE

# DISTURBING FACTS •AFFECTS ALL VENDORS Not an isolated incident Deeper issue in DRAM scaling

# •UNADDRESSED FOR YEARS

# HOW TO PREVENT COUPLING Previous Approaches? 1. Make Better Chips: Expensive 2. Rigorous Testing: Takes Too Long



# FREQU<br/>ENT<br/>ENT<br/>REFRES<br/>HFROW<br/>FROW





# TWO NAIVE SOLUTIONS 1. LIMIT ACCESSES TO ROW Access Interval > 500ns

2. REFRESHALLROWS LARGE OVERHEAD: Refre ERFYENERGY, COMPLEXITY



# **PARR: CHANCE OF** FRROR NO REFRESHES IN N TRIALS Probability: 0.999<sup>N</sup>

• N=128K FOR ERROR (64ms) Probability: 0.999<sup>128K</sup> = 10<sup>-56</sup>

STRONG RELIABI **LHW** PERF **OVERHE** NO **STORAGE OVERHE** 

 $9.4 \times 10^{-14}$  Errors/Year

0.20% Slowdown

0 Bytes

# **RELATED WORK**

- Security Exploit (Seaborn@Google 2015)
- Industry Analysis (Kang@SK Hynix 2014)
  - "... will be [more] severe as technology shrinks down."
- Targeted Row Refresh (JEDEC 2014)
- DRAM Testing (e.g., Van de Goor+ 71

# Emerging Memory Technologies

### Limits of Charge Memory

- Difficult charge placement and control
  - Flash: floating gate charge
  - DRAM: capacitor charge, transistor leakage
- Reliable sensing becomes difficult as charge storage unit size reduces



### Charge vs. Resistive Memories

- Charge Memory (e.g., DRAM, Flash)
  - Write data by capturing charge Q
  - Read data by detecting voltage V

- Resistive Memory (e.g., PCM, STT-MRAM, memristors)
  - Write data by pulsing current dQ/dt
  - Read data by detecting resistance R

# Promising Resistive Memory Technologies

#### PCM

- Inject current to change material phase
- Resistance determined by phase

#### STT-MRAM

- Inject current to change magnet polarity
- Resistance determined by polarity
- Memristors/RRAM/ReRAM
  - Inject current to change atomic structure
  - Resistance determined by atom distance

# What is Phase Change Memory?

- Phase change material (chalcogenide glass) exists in two states:
  - Amorphous: Low optical reflexivity and high electrical resistivity
  - Crystalline: High optical reflexivity and low electrical resistivity



PCM is resistive memory: High resistance (0), Low resistance (1) PCM cell can be switched between states reliably and quickly

# How Does PCM Work?

- Write: change phase via current injection
  - □ SET: sustained current to heat cell above T*cryst*
  - RESET: cell heated above T*melt* and quenched
- Read: detect phase via material resistance
  - amorphous/crystalline





Photo Courtesy: Bipin Rajendran, IBM Slide Courtesy: Moinuddin Qureshi, IBM

## Opportunity: PCM Advantages

#### Scales better than DRAM, Flash

- Requires current pulses, which scale linearly with feature size
- Expected to scale to 9nm (2022 [ITRS])
- Prototyped at 20nm (Raoux+, IBM JRD 2008)

#### Can be denser than DRAM

- Can store multiple bits per cell due to large resistance range
- Prototypes with 2 bits/cell in ISSCC'08, 4 bits/cell by 2012

#### Non-volatile

Retain data for >10 years at 85C

#### No refresh needed, low idle power

# **PCM** Resistance $\rightarrow$ Value



# Multi-Level Cell PCM

- Multi-level cell: more than 1 bit per cell
  - Further increases density by 2 to 4x [Lee+,ISCA'09]
- But MLC-PCM also has drawbacks
  - Higher latency and energy than single-level cell PCM

# **MLC-PCM** Resistance → Value



# **MLC-PCM** Resistance $\rightarrow$ Value

Less margin between values

→ need more precise sensing/modification of cell contents
→ higher latency/energy (~2x for reads and 4x for writes)



### Phase Change Memory Properties

- Surveyed prototypes from 2003-2008 (ITRS, IEDM, VLSI, ISSCC)
- Derived PCM parameters for F=90nm

- Lee, Ipek, Mutlu, Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative," ISCA 2009.
- Lee et al., "Phase Change Technology and the Future of Main Memory," IEEE Micro Top Picks 2010.

### Phase Change Memory Properties: Latency

Latency comparable to, but slower than DRAM



### Phase Change Memory Properties

- Dynamic Energy
  - 40 uA Rd, 150 uA Wr
  - 2-43x DRAM, 1x NAND Flash
- Endurance
  - Writes induce phase change at 650C
  - Contacts degrade from thermal expansion/contraction
  - 10<sup>8</sup> writes per cell

□ 10<sup>-</sup>°x DRAM, 10<sup>3</sup>x NAND Flash

- Cell Size
  - $\neg$  9-12F<sup>2</sup> using BJT, single-level cells

I I.5x DRAM, 2-3x NAND

(will scale with feature size, MLC)

### Phase Change Memory: Pros and Cons

- Pros over DRAM
  - Better technology scaling (capacity and cost)
  - Non volatile Persistent
  - Low idle power (no refresh)
- Cons
  - Higher latencies: ~4-15x DRAM (especially write)
  - □ Higher active energy: ~2-50x DRAM (especially write)
  - Lower endurance (a cell dies after  $\sim 10^8$  writes)
  - Reliability issues (resistance drift)
- Challenges in enabling PCM as DRAM replacement/helper:
  - Mitigate PCM shortcomings
  - □ Find the right way to place PCM in the system

# PCM-based Main Memory (I)

How should PCM-based (main) memory be organized?



- Hybrid PCM+DRAM [Qureshi+ ISCA'09, Dhiman+ DAC'09]:
  - How to partition/migrate data between PCM and DRAM

# PCM-based Main Memory (II)

How should PCM-based (main) memory be organized?



- Pure PCM main memory [Lee et al., ISCA'09, Top Picks'10]:
  - How to redesign entire hierarchy (and cores) to overcome PCM shortcomings

# An Initial Study: Replace DRAM with PCM

- Lee, Ipek, Mutlu, Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative," ISCA 2009.
  - Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC)
  - Derived "average" PCM parameters for F=90nm

### Density

 $\triangleright$  9 - 12 $F^2$  using BJT

⊳ 1.5× DRAM

#### Endurance



▷ 1E-08× DRAM

#### Latency

50ns Rd, 150ns Wr

 $\triangleright$  4×, 12× DRAM

#### Energy

▷ 40µA Rd, 150µA Wr

 $\triangleright$  2×, 43× DRAM

### Results: Naïve Replacement of DRAM with PCM

- Replace DRAM with PCM in a 4-core, 4MB L2 system
- PCM organized the same as DRAM: row buffers, banks, peripherals
- 1.6x delay, 2.2x energy, 500-hour average lifetime





 Lee, Ipek, Mutlu, Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative," ISCA 2009.

# Architecting PCM to Mitigate Shortcomings

- Idea 1: Use multiple narrow row buffers in each PCM chip
   □ Reduces array reads/writes □ better endurance, latency, energy
- Idea 2: Write into array at cache block or word granularity
  - $\hfill\square$  Reduces unnecessary wear



### Results: Architected PCM as Main Memory

- 1.2x delay, 1.0x energy, 5.6-year average lifetime
- Scaling improves energy, endurance, density



- Caveat 1: Worst-case lifetime is much shorter (no guarantees)
- Caveat 2: Intensive applications see large performance and energy hits
- Caveat 3: Optimistic PCM parameters?

### PCM As Main Memory

 Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, <u>"Architecting Phase Change Memory as a Scalable DRAM</u> <u>Alternative"</u> *Proceedings of the <u>36th International Symposium on Computer</u> <u>Architecture</u> (ISCA), pages 2-13, Austin, TX, June 2009. <u>Slides</u>* 

<u>(pdf)</u>

### Architecting Phase Change Memory as a Scalable DRAM Alternative

Benjamin C. Lee† Engin Ipek† Onur Mutlu‡ Doug Burger†

†Computer Architecture Group Microsoft Research Redmond, WA {blee, ipek, dburger}@microsoft.com

‡Computer Architecture Laboratory Carnegie Mellon University Pittsburgh, PA onur@cmu.edu

# **Review #5**

# Flipping Bits in Memory Without Accessing Them

Yoongu Kim et al., ISCA 2014

# CSC 2224: Parallel Computer Architecture and Programming Advanced Memory

Prof. Gennady Pekhimenko University of Toronto Fall 2019

The content of this lecture is adapted from the slides of Vivek Seshadri, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU